What are they?
Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise
How do we make them?
R with as.factor()lm(y ~ as.factor(category_var))
What do they do?
summary(lm(INCEARN ~ sex, acs_data))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 143428.21 2350.162 61.02908 0.000000e+00
## sexMale 60556.48 2864.986 21.13675 4.756861e-97
We can include all categories and drop the intercept
summary(lm(INCEARN ~ -1 + sex, acs_data))$coefficients## Estimate Std. Error t value Pr(>|t|)
## sexFemale 143428.2 2350.162 61.02908 0
## sexMale 203984.7 1638.562 124.49010 0
How does our interpretation change?
## Estimate Std. Error t value Pr(>|t|)
## sexFemale 143428.2 2350.162 61.02908 0
## sexMale 203984.7 1638.562 124.49010 0
We can consider a case where there are several categories:
Brexit Vote: What are regional differences in vote for Brexit?
Five regions: Southern England, Midlands, Northern England, Wales, and Scotland
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.61020359 7.7829764 5.7317665 2.048832e-08
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5The Midlands 8.57606186 1.2532253 6.8431923 3.179826e-11
## Region5Northern England 6.35984691 1.3248565 4.8004045 2.293393e-06
## Region5Wales 2.30867033 2.0441434 1.1294072 2.594499e-01
## Region5Scotland -11.60364022 1.8478179 -6.2796448 9.422928e-10
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.18626546 7.7855372 6.8314188 3.420699e-11
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Northern England -2.21621496 1.5559745 -1.4243260 1.551859e-01
## Region5Wales -6.26739154 2.2030506 -2.8448696 4.687357e-03
## Region5Scotland -20.17970209 2.0149071 -10.0152022 4.494699e-21
## Region5Southern England -8.57606186 1.2532253 -6.8431923 3.179826e-11
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.97005050 7.3635671 6.9219239 1.946734e-11
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Wales -4.05117658 2.1752891 -1.8623624 6.333598e-02
## Region5Scotland -17.96348713 1.9097366 -9.4062643 5.303722e-19
## Region5Southern England -6.35984691 1.3248565 -4.8004045 2.293393e-06
## Region5The Midlands 2.21621496 1.5559745 1.4243260 1.551859e-01
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.91887392 7.6346802 6.1454930 2.043370e-09
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Scotland -13.91231055 2.4939139 -5.5785047 4.660849e-08
## Region5Southern England -2.30867033 2.0441434 -1.1294072 2.594499e-01
## Region5The Midlands 6.26739154 2.2030506 2.8448696 4.687357e-03
## Region5Northern England 4.05117658 2.1752891 1.8623624 6.333598e-02
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.00656337 7.2248567 4.5684731 6.681290e-06
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Southern England 11.60364022 1.8478179 6.2796448 9.422928e-10
## Region5The Midlands 20.17970209 2.0149071 10.0152022 4.494699e-21
## Region5Northern England 17.96348713 1.9097366 9.4062643 5.303722e-19
## Region5Wales 13.91231055 2.4939139 5.5785047 4.660849e-08
How do you interpret this?
summary(lm(Pct_Lev ~ -1 + Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Scotland 33.00656337 7.2248567 4.5684731 6.681290e-06
## Region5Southern England 44.61020359 7.7829764 5.7317665 2.048832e-08
## Region5The Midlands 53.18626546 7.7855372 6.8314188 3.420699e-11
## Region5Northern England 50.97005050 7.3635671 6.9219239 1.946734e-11
## Region5Wales 46.91887392 7.6346802 6.1454930 2.043370e-09
In R: you can shift the reference category using the following:
levels = c('Scotland', 'Southern England', 'The Midlands', 'Northern England', 'Wales')
brexit$Region5 = factor(brexit$Region5, levels = levels)
The first category is dropped (in this case, Scotland)
Hooke’s law is a law of physics that states that the force (\(F\)) needed to extend or compress a spring by some distance \(x\) scales linearly with respect to that distance.
So the length of a spring is linearly propotional the force applied. For a given weight \(X_i\) on a spring, the length \(Y_i\) is:
\[Y_i = \alpha + \beta X_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)
\[Y_i = \alpha + \beta X_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)
This is a statistical model:
We use observed data to estimate the parameters \(\alpha, \beta, \sigma^2\)
The statistical model for the regression makes certain assumptions about the process generating the data.
To find a line, we create a statistical model for a linear equation that describes the process generating the data.
\[Y_i = \alpha + \beta X_i + \epsilon_i\]
deterministic (not random) \(\alpha + \beta X_i\)
stochastic (random error) \(\epsilon_i\)
\(\alpha\): parameter for the intercept (fixed for all cases)
\(\beta\): parameter for the slope (fixed for all cases)
\(X_i\): value of independent variable (input into the equation)
\(Y_i\): value of dependent variable (determined by the equation)
\(\epsilon_i\): random error for observation \(i\)
x = rnorm(100)
y = 3 + x * 0.5
plot(x, y, main = bquote(Y[i] == alpha + beta * X[i]))
abline(a = 3, b = 0.5)y_star = 3 + x * 0.5 + rnorm(100)
plot(x, y_star, main = bquote(Y[i] == alpha + beta * X[i] + epsilon[i]))
abline(a = 3, b = 0.5)This regression model:
\[Y_i = \alpha + \beta X_i + \epsilon_i\]
is linked to the least-squares calculations we did the last two weeks. The regression model can be estimated with least squares:
\[Y_i = \hat{\alpha} + \hat{\beta} X_i + e_i\]
where \(Var(e) = \widehat{\sigma^2}\)
Differences:
Another key assumption: (if \(X\) is also a random variable) \(X \mathrel{\unicode{x2AEB}} \epsilon\)
If we put a heavier weight (large \(X_i\)) and that increased likelihood of large \(\epsilon_i\) and lighter weight (small \(X_i\)) increased likelihood of smaller \(\epsilon_i\), then we would over-estimate \(\beta\).
Model equation is correctly specified (correct functional forms, all relevant variables are included). \(Y_i = \alpha + \beta X_i + \epsilon_i\) is the true data-generating function
\(\epsilon_i\) are independently, identically distributed (i.i.d.), \(E(\epsilon_i) = 0\) and variance of \(\sigma^2\). homoskedastic vs. heteroskedastic
\(\epsilon_i\) is independent of \(X_i\): \(X \mathrel{\unicode{x2AEB}} \epsilon\)
We cannot empirically verify these assumptions.
When using least squares to estimate this model, we mathematically impose \(e_i\) to be orthogonal to \(x_i\) and \(E(e_i) = 0\). But this is by assumption, not a fact.
If these assumptions are a good model for the data we observe, then with large \(n\), we can estimate our parameters \(\alpha, \beta, \sigma^2\) using least squares.
In the multivariate situation, the statistical model for regression is:
\[Y = \mathbf{X\beta} + \epsilon\]
Each row of this becomes:
\[Y_i = \mathbf{X_i\beta} + \epsilon_i\]
Statistical Assumptions (statistical model):
Mathematical Assumptions (least squares):
Statistical Assumptions (statistical model):
We want to estimate \(p + 1\) parameters… (what are they?)
\(\beta\) estimated by \(\widehat{\beta}\) using least squares:
\[\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
Note: \(\widehat{\beta}\) is a random vector; each element is a random variable. Why?, recall \(Y = \mathbf{X\beta} + \epsilon\)
By assumption: \(Y = \mathbf{X\beta} + \epsilon\), so:
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X\beta} + \epsilon)\)
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
So:
\(E(\widehat{\beta}|X) = E(\mathbf{\beta}|\mathbf{X}) + E((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon|\mathbf{X})\)
\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon|\mathbf{X})\)
And because, by assumption, \(X \mathrel{\unicode{x2AEB}} \epsilon\)
\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon)\)
And because, by assumption, \(E(\epsilon) = 0\)
\(E(\widehat{\beta}|X) = \mathbf{\beta}\)
Thus, if our model assumptions are correct, \(\widehat{\beta}\) is an unbiased estimate of parameter \(\beta\)
We need to find a way of finding the uncertainty in our estimate \(\widehat{\beta}\)
\[\scriptsize{\begin{pmatrix} Var(\beta_1) & Cov(\beta_2,\beta_1) & Cov(\beta_3,\beta_1) &\ldots & Cov(\beta_p,\beta_1) \\ Cov(\beta_1,\beta_2) & Var(\beta_2) & Cov(\beta_3,\beta_2) & \ldots & Cov(\beta_p,\beta_2) \\ Cov(\beta_1,\beta_3) & Cov(\beta_2,\beta_3) & Var(\beta_3) & \ldots & Cov(\beta_p,\beta_3) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Cov(\beta_1,\beta_p) & Cov(\beta_2,\beta_p) & Cov(\beta_3, \beta_p) & \ldots & Var(\beta_p)\end{pmatrix}}\]
How do we use the variance-covariance matrix?
The square-root of diagnonal elements (variances) gives standard error for each parameter estimate in \(\widehat{\beta}\) (hypothesis testing)
The off-diagonal elements can help answer: \(\beta_2 + \beta_3 \neq 0\). We need \(Cov(\beta_2, \beta_3)\) to get \(Var(\beta_2 + \beta_3)\).
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\), So:
\[cov(\widehat{\beta}|X) = E((\widehat{\beta} - \beta)(\widehat{\beta} - \beta)' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon)(\epsilon'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}) | X)\]
\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
What really matters here is: \(E(\epsilon\epsilon'|X)\)
\[E(\epsilon\epsilon'|X) = \sigma^2\mathbf{I}_{p\times p}\]
\(\epsilon\epsilon'\) is \(n \times n\) matrix with \(ij\)th elements \(\epsilon_i\epsilon_j\)
Taking the expectation:
\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\sigma^2\mathbf{I}_{n\times n}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
\[ = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\] \[cov(\widehat{\beta}|X) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]
Can we start computing standard errors for \(\widehat{\beta}\) now? What is missing?
\(\hat{\sigma^2} = \frac{\sum\limits_i^n e_i^2}{n - p}\)
and \(E[\hat{\sigma^2} | X] = \sigma^2\) (unbiased)
so, we estimate the variance-covariance matrix:
\(\widehat{cov}(\widehat{\beta}) = \widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\)
THe variance (first diagonal element of \(\widehat{cov}(\widehat{\beta})\)) is
\(Var(\hat{\beta_p}) = \frac{\hat{\sigma^2}}{\sum\limits_i^n (X^*_{i} - \overline{X^*})^2}\)
So the standard error is:
\(SE(\hat{\beta_p}) = \sqrt{Var(\hat{\beta_p})} = \frac{\widehat{\sigma}}{\sqrt{n}\sqrt{Var(X_p^*)}}\)
99 in control, 1 in treatment
## [1] 0.01
50 in control, 50 in treatment
## [1] 0.2525253
In sum:
For these to be correct, we must assume:
If any assumption is wrong, unbiasedness of \(\widehat{\beta}\) and standard errors will be incorrect.
When do things go wrong?
\(H_\emptyset: \beta = \beta^*\) \(H_a: \beta \neq \beta^*\)
Typically, we take \(\beta^* = 0\), but in principle it can be any value.
Because we estimate the standard errors, use a \(t\) statistic:
\(t_{n-p} = \frac{\hat{\beta} - \beta^*}{se(\hat{\beta})}\)
Then obtain a \(p\) value (one-sided or two-sided, depending on \(H_a\))
If we are using a \(t\)-test, don’t we need to make assumptions about asymptotic normality? Yes.
Two approaches
assume normality: If we assume \(\epsilon\) is distributed normally \(N(mean = 0, variance = \sigma^2)\), then sampling distribution of \(\hat{\beta}\) is always normal.
Two approaches
asymptotic normality. If \(\epsilon\) not normally distributed: under some believable and verifiable assumptions about the data and assumption that observations are independent, then \(\lim_{n\to\infty}\) sampling distribution of \(\hat{\beta}\) is normally distributed by the central limit theorem. This is true if \(n\) is moderately large and data is not highly non-normal. If that happens, we can always bootstrap.
Regress Brexit Vote on Rainfall
#Make design-matrix
X = cbind(1, brexit$total_precip_polling)
#Make outcome matrix
Y = brexit$Pct_Lev
#Solve for Beta
beta_mat = solve(t(X) %*% X) %*% t(X) %*% Y
#Create y_hat
y_hat = X %*% beta_mat
#Create residuals
e = Y - y_hat#Estimate sigma squared
sigma2_hat = sum(e^2) / (nrow(X) - ncol(X))
#Estimate variance-covariance
var_covar = sigma2_hat * solve(t(X) %*% X)
var_covar## [,1] [,2]
## [1,] 0.39786504 -0.012415453
## [2,] -0.01241545 0.001306648
#beta standard errors:
beta_ses = var_covar %>% diag %>% sqrt
#get coefficient, standard errors, and t-stats
cbind(beta_mat, beta_ses, beta_mat / beta_ses)## beta_ses
## [1,] 54.1460994 0.63076544 85.841894
## [2,] -0.1058053 0.03614759 -2.927036
#check against `lm`
lm(Pct_Lev ~ total_precip_polling, brexit) %>% summary %>% .$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.1460994 0.63076544 85.841894 5.311451e-250
## total_precip_polling -0.1058053 0.03614759 -2.927036 3.628998e-03
Let’s interpret the hypothesis test:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.1460994 0.63076544 85.841894 5.311451e-250
## total_precip_polling -0.1058053 0.03614759 -2.927036 3.628998e-03
For the intercept: is it “significant”? Is this informative in any meaningful way?
For the intercept: is it “slope”? Is this informative in any meaningful way?
Does the regression model apply in this context?
Is the model: \(Brexit \ Leave \ \%_i= \alpha + \beta \ mmRainfall_i + \epsilon_i\) correct?
Is \(\epsilon\) i.i.d. with mean \(0\) and variance \(\sigma^2\)?
Is \(\epsilon_i\) independent of \(mmRainfall_i\)?
Given these assumptions (sometimes called Gauss-Markov assumptions)
we can estimate \(\beta\) using \(\widehat{\beta}\) from least squares, and this estimate will be
But are these assumptions reasonable (compared to Neyman causal model)?